Distributional Semantics Approach to Detecting Synonyms in Croatian Language

نویسندگان

  • Mladen Karan
  • Jan Šnajder
  • Bojana Dalbelo Bašić
چکیده

Identifying synonyms is important for many natural language processing and information retrieval applications. In this paper we address the task of automatically identifying synonyms in Croatian language using distributional semantic models (DSM). We build several DSMs using latent semantic analysis (LSA) and random indexing (RI) on the large hrWaC corpus. We evaluate the models on a dictionarybased similarity test – a set of synonymy questions generated automatically from a machine readable dictionary. Results indicate that LSA models outperform RI models on this task, with accuracy of 68.7%, 68.2%, and 61.6% on nouns, adjectives, and verbs, respectively. We analyze how word frequency and polysemy level affect the performance and discuss common causes of synonym misidentification. Prepoznavanje hrvaških sopomenk s pomočjo distribucijske semantike Prepoznavanje sopomenk je pomembno za številne aplikacije na področju jezikovnih tehnologij in poizvedovanja po informacijah. V pričujočem prispevku se ukvarjamo z avtomatskim prepoznavanjem sopomenk v hrvaščini, pri čemer uporabljamo modele distribucijske semantike (DSM). S pomočjo latentne semantične analize (LSA) in naključnega indeksiranja (RI) iz korpusa hrWaC zgradimo več različnih modelov. Modele nato ovrednotimo s pomočjo testov sinonimije, ki so avtomatsko izluščeni iz strojno berljivega slovarja. Rezultati kažejo, da so modeli, zgrajeni s pomočjo LSA, za to nalogo uspešnejši, njihova natančnost pa je 68,7% za samostalnike, 68,2% za pridevnike in 61,6% za glagole. V prispevku analiziramo tudi, kako pogostost pojavljanja besed v korpusu in stopnja njihove večpomenskosti vplivajo na rezultate in razpravljamo o najpogostejših razlogih za napake, do katerih pri prepoznavanju prihaja.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Frame Semantics to Teaching Seeing and Hearing Vocabulary to Iranian EFL Learners

A term in one language rarely has an absolute synonymous meaning in the same language; besides, it rarely has an equivalent meaning in an L2. English synonyms of seeing and hearing are particularly grammatically and semantically different. Frame semantics is a good tool for discovering differences between synonymous words in L2 and differences between supposed L1 and L2 equivalents. Vocabulary ...

متن کامل

Modeling Semantic Compositionality of Croatian Multiword Expressions

A distinguishing feature of many multiword expressions (MWEs) is their semantic non-compositionality. Determining the semantic compositionality of MWEs is important for many natural language processing tasks. We address the task of modeling semantic compositionality of Croatian MWEs. We adopt a composition-based approach within the distributional semantics framework. We build and evaluate model...

متن کامل

Determining the Semantic Compositionality of Croatian Multiword Expressions

A distinguishing feature of many multiword expressions (MWEs) is their semantic non-compositionality. Being able to automatically determine the semantic (non-)compositionality of MWEs is important for many natural language processing tasks. We address the task of determining the semantic compositionality of Croatian MWEs. We adopt a composition-based approach within the distributional semantics...

متن کامل

Méthode semi-compositionnelle pour l'extraction de synonymes des termes complexes

Automatic synonyms and semantically related word extraction is a challenging task, useful in many NLP applications such as question answering, search query expansion, text summarization, etc. While different studies addressed the task of word synonym extraction, only a few investigations tackled the problem of acquiring synonyms of multi-word terms (MWT) from specialized corpora. To extract pai...

متن کامل

Semi-compositional Method for Synonym Extraction of Multi-Word Terms

Automatic synonyms and semantically related word extraction is a challenging task, useful in many NLP applications such as question answering, search query expansion, text summarization, etc. While different studies addressed the task of word synonym extraction, only a few investigations tackled the problem of acquiring synonyms of multi-word terms (MWT) from specialized corpora. To extract pai...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012